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I. INTRODUCTION 


The extraction of information from remote sensing images 
has been an active research field, with essential 
applications for urban planning, urban dynamics modeling, 
and disaster damage assessment. Semantic Segmentation is 
the process of assigning a label to each pixel of an image 
and decompose a scene into semantically meaningful 


regions [1]. Traditionally, 


performed either pixel-wise or 


approaches. The latter is known as Geographic Object- 
(GEOBIA) [2] and usually 
These approaches typically 


Based Image Analysis 
outperforms the former. 


consist of two separate steps: Segmentation followed by 
classification. Because the second step’s accuracy usually 
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semantic Segmentation is 
with object-based 


Abstract—Semantic Segmentation is a technique in Computer Sciences 
(CS) to extract information from images. Recent advances in Artificial 
Intelligence, particularly in Deep Learning, Semantic Segmentation 
combined with techniques such as convolutional neural networks, have 
presented better results and exciting results. Due to its power and better 
results than classical approaches, there has been an increase in research 
articles in Remote Sensing that propose using deep learning-based 
semantic Segmentation to extract information from satellite or airborne 
imagery. In this paper, we surveyed the state-of-the-art of Semantic 
Segmentation in Remote Sensing from 2010 until 2020 by identifying the 
research topics and the number of publications and citations. 
Furthermore, we also pointed out the fundamental algorithms, the main 
convolutional neural network architectures, backbones, and the most used 
evaluation metrics. In addition, some datasets were highlighted, as well as 
some frameworks that can be used to train semantic segmentation deep 
neural networks. Finally, we have shown some applications of the 
showcased techniques and concluded the paper by pointing out some 
research opportunities of Remote Sensing Semantic Segmentation, 
concerning some bleeding-edge scientific papers published in 2020 in CS. 


relies on the first step’s quality, image segmentation is 
critical for GEOBIA. 


However, image segmentation is not a trivial task, 
given that most algorithms rely on subjective and arbitrary 
parameters setting. The incorrect choice of parameters may 
lead to undesired results, such as under-segmentation and 
over-segmentation, which may impact the classification 
accuracy. Moreover, segmentation techniques’ 
generalization capability is limited because they cannot 
deal with the objects’ complexity present in an image. For 
example, a given set of parameters can provide good 
(e.g., 
results in 


segmentation results at homogeneous 
fields) and 
heterogeneous areas like urban environments. 


regions 


agricultural unsatisfactory 
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Thus, image analysts usually try several parameter 
combinations to achieve a suitable outcome for an entire 
scene, a time-consuming task. Adaptive segmentation 
algorithms were proposed to deal with the diversity of 
image objects [3, 4] or automatic tuning of segmentation 
parameters [5, 6]. However, these methods are complex, 
rely on human-made reference images, and are designed 
for specific applications. 


Recently, improvements in computation power and 
parallel processing algorithms using graphics processing 
units (GPUs) favored the development of deep learning 
(DL) [7, 8], particularly convolutional neural networks 
(CNNs), a type of DL method introduced by [9], have 
become exceedingly popular for classification, object 
localization, and semantic segmentation of remote sensing 
images [10]. CNNs are designed to automatically extract 
spatial patterns (e.g., shapes, edges, texture) of images 
using a set of convolutions and pooling operations, hence 
learning object-specific characteristics in an end-to-end 
fashion. 


Particularly in the context of semantic Segmentation, 
neural networks have achieved outstanding results [11, 12, 
13, 14, 15, 16, 17, 18]. Unlike traditional pixel-wise 
classification, semantic Segmentation using CNNs can 
preserve the object boundaries producing sharp, fine-scale 
Segmentation. Fully convolutional networks (FCNs) were 
the first approach that employed deep networks for 
semantic Segmentation. The rationale behind FCNs relies 
on transforming the fully connected layers into upsampling 
or transposed convolutional layers [19] to perform dense 
pixel predictions. The pioneering work of [19] adapted 
well-known CNNs models such as AlexNet for semantic 
segmentation tasks. 


In semantic Segmentation, the smallest segment can be 
a single pixel, which is not adequate for most applications 
of information extraction using high-resolution remote 
sensing images because, in these images, it is improbable 
to find a target with the dimensions of a single pixel. To 
overcome this problem, instance segmentation combined 
object detection and semantic segmentation can be used to 
classify an object at the pixel level and outline its exact 
shape [20]. Both semantic Segmentation and instance 
segmentation networks provide the opportunity to 
simultaneously detect and classify building footprints 
without the need for a previous segmentation step, thus 
vanquishing the limitations of GEOBIA. 


This paper will cover the latest state-of-the-art (SOTA) 
of semantic Segmentation in very high-resolution remote 
sensing, focusing only on methods that use convolutional 
neural networks (CNNs). We also want to identify 
research opportunities in RS by briefly analyzing the latest 
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trends on CS. To fulfill this goal, this review is organized 
as follows: in section 2, we show the SOTA of semantic 
Segmentation in RS and CS papers; in section 3, we cover 
the basic concepts of DL and semantic segmentation 
techniques, the primary neural network architectures, the 
available datasets and frameworks and finally some raster 
to vector methods; and in section 4 we sum up the 
concepts presented in this paper, as well as cover the 
Opportunities of research in geosciences based on the 
comparison of the SOTA semantic segmentation methods. 


I. LITERATURE REVIEW 


We conducted a literature review on remote sensing to 
identify the most relevant deep learning techniques and 
methods employed to extract information from remote 
sensing imagery, presented in section 2.1. 


Moreover, to identify possible new techniques from 
computer sciences, we carried out a brief literature survey 
on review articles and also pointed out the best results on 
popular benchmarks showcased on Papers With Code [21], 
shown in section 2.2. 


2.1. Literature Review on Remote Sensing 


To perform our literature review, we searched the 
knowledge database SciELO Citation Index (Web of 
Science) to investigate further what are the main research 
topics, the number of publications per year, and the most 
cited papers. This information was used to try to delineate 
the most relevant papers so that we could further analyze 
them so that we could extract more helpful information, 
such as the most popular methods employed. 


The term” Semantic Segmentation” was searched using 
the time range 2010-2020 as the filter, and there were 
10,145 results, then were filtered once more, considering 
only the” Remote Sensing” field, yielding 718 results. To 
identify the main research topics, we built a word cloud, 
shown in figure 1, with the keywords of these results. 
Analyzing the picture, we can infer that the research 
conducted from 2010 until 2020 has used neural networks, 
particularly convolutional neural networks (CNNs), to 
extract or identify features using high-resolution satellite 
or aerial imagery. Common ground features extracted by 
the considered papers are roads and buildings. 
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Fig. 1: Word cloud built with the keywords of the 
results of the search Semantic Segmentation on the Web of 
Science database, from 2010 to 2020, considering only 
papers in Remote Sensing. Larger words mean more 
recurring terms in the research papers’ keywords. 


During the considered time range, there has been a 
nearly exponential growth in the number of papers in 
remote sensing that covers semantic Segmentation that can 
be visualized in figure 2. The years 2015 and 2016 have 
presented a slight increase in the number of publications 
that might be a consequence of the papers published in CS, 
such as [22, 23, 24]. From 2017 until 2019, there has been 
a significant increase in the number of research papers, 
peaking at 140 in 2019. Since 2020 is not over yet, we can 
expect an even more substantial number than 2019, since 
the number of research papers published in 2020 is much 
higher than 2018’s and only 40% smaller than2019’s. 


PUBLICATIONS 
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Fig. 2: Number of publications in Remote Sensing with the 
subject Semantic Segmentation from 2010 to 2020 
registered on Web of Science. 


We further narrowed our chosen papers by cross- 
referencing our search results with data from a GitHub 


repository (https://github.com/thho/DLinEO review), 


which is under the license CC-BY-4.0 and contains data 
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used in [1, 25]. Using this info, we have only considered 
semantic Segmentation, resulting in 261 papers to analyze. 
Then, we built the graph in figure 3 to find out the most 
popular architecture. We concluded that the most famous 
architecture in RS papers is the U-Net, followed by custom 
architectures and then Fully Convolutional Networks 
(FCNs). 


Number of papers per 
architecture family 


Ee eS 





E U-Net E custom OFCN 

D SegNet O ResNet @ Vintage 
E DeepLab @ PSPNet @ FuseNet 
@ensamble OEDSR @ RefineNet 


MobileNet-V2 O Inception 


Fig. 3: Papers grouped by architecture family. 


Then, to evaluate the backbone usage, we built a word 
cloud shown in figure 4 to find out the most popular 
backbones, and we found out that ResNets, VGG-16, and 
the Inception series are very popular. 


ResNet-18 
VGG-19 
DenseNet 5 


ResNet-1014° 


- VGG-16" 
$C customs: 
2020 VGG=" 
we AlexNet 


VA 
6, VW) Xception 


oe 
os < gr ResNet-34 
&  ResNet-50 
Inception-V3 


DenseNet-20 
DenseNet-l21 
ResNext-50 
Fig. 4: Family architectures used in Semantic 
Segmentation papers in Remote Sensing in the considered 
papers. Larger names represent more popular family 
architecture. 
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Backbone distribution per architecture family 
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Fig 5: Tree Map representing the backbone distribution for each type of convolutional neural network architecture 
used in the considered papers. 
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To understand the relationship between the backbones 
and the architectures chosen in each paper and presented in 
the data here analyzed, we built a tree map shown in figure 
5, which leads us to conclude that U-Nets with custom and 
ResNet backbones are very popular, followed by custom 
backbone and custom architecture, then by VGG-16 
backbone with FCN architecture, and finally, VGG-16 
backbone with SegNet architecture. 


2.2. Brief Literature Review on Computer Science 


There are several review articles in Computer Sciences 
[26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 
41, 42] that portray the evolution of deep learning-based 
semantic segmentation methods. Common research fields 
on CS that use the mentioned techniques are research on 
self-driving vehicles [43, 44], pedestrian detection [45, 46] 
and computer aided diagnosis using medical images [47, 
48]. 


The surveyed papers cover similar architectures and 
backbones already listed on 2.1. The novel backbones that 
were not identified in section 2.1 are the ones from the 
EfficientNet family, ResNeSt [49], and SE-ResNet family 
[50]. The training datasets used in CS applications are one 
of the main differences from RS studies. As examples of 
common datasets used in CS, we can cite the Cityscapes 
dataset [51], the PASCAL VOC (PASCAL Visual Object 
Classes Challenge) [52], and its extension, the PASCAL 
Context [39]. 


There is a platform called Papers With Code [21] that 
gathers results of several papers, as well as codes that are 
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available online to reproduce such study considered 
papers. On this website, the results of each benchmark are 
ranked, and the best models are presented. Some of the 
models with the best results on the previously mentioned 
datasets are shown in table 1: 


Table 1: Best models on some available datasets, 
according to Papers With Code [21]. 














Dataset Best Model Paper Title mloU 
Cityscapes | HRNet-OCR Hierarchical 85.1% 
test MultiScale 
Attention for 
Semantic 
Segmentation [53] 
PASCAL EfficientNet- Rethinking 90.5% 
voc L2+NAS-FPN Pretraining and 
Self- 
2012 test 4 
training [54] 
PASCAL Channelized Channelized Axial | 60.5% 
Conia Axial Attention | Attention for 
(CAA) with | Semantic 


Simple decoder 


(Efficientnet-B7) | [55] 


Segmentation 





Cityscapes | HRNetV2- 
val OCR+PSA 


Polarized 86.95% 
SelfAttention: 
Towards High- 
quality Pixelwise 
Regression 


[56] 

















Other worth mentioning techniques found on the cited 
review papers and the research shown in table 1 are self- 
training [57], Channelized Axial Attention [55], and 
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Polarized Self-Attention [56]. 


HI. MAIN CONCEPTS AND METHODOLOGIES IN 


SEMANTIC SEGMENTATION 


From the SOTA review carried out in section 2, we 
identified some of the main concepts and techniques that 
we need to understand when studying semantic 
segmentation techniques applied to remote sensing. 


Furthermore, considering the selected papers and 
regarding the ideas highlighted in the SOTA review, we 
will present some basic concepts in section 3.1, some 
training improving techniques in section 3.2, the main 
convolutional neural network backbones in section 3.3, the 
main architectures on section 3.4, some applications on RS 
and examples of some available datasets on section 3.5, 
and finally, some frameworks and tools on section 3.6. 


3.1. Main Concepts of Convolutional Neural Networks 


The convolution layer is one of the building blocks of 
Deep Learning. It can be defined as a combination of 
linear and nonlinear operations such as convolution and 
activation functions [58]. 


Convolution is a mathematical operation that applies an 
array of numbers (kernel) to the input, enabling feature 
extraction operations [58]. On the other hand, the 
activation function is a mathematical resource to introduce 
nonlinearities in the convolutional neural networks. Some 
examples of them are the sigmoid function, the hyperbolic 
tangent function, the rectified linear unit (ReLU) [58], the 
leaky rectified linear unit (Leaky ReLU) [59], the 
exponential linear unit (ELU) [60], the scaled exponential 
linear unit (SELU) [61], the gaussian error linear unit 
(GELU) [62], the Mish [63] and the Softmax [64]. Their 
mathematical definitions can be seen, respectively, on 
equations 1, 2, 3, 4, 5, 6, 7, 8, and 9. It is worth mentioning 
that Softmax is often used as an output function on 
convolutional neural networks. 





sigmoid(x) = 


l+e™® (1) 
tanh(x) = CF 
eT +e? (2) 
: 0 ifxr<0 
ReLU (x) = l 7 
aP a > 
fi if 2 raat 0 (3) 
Pian’ 0.0lr ifs 0 
Leaky te Lt (xz) = l 3 fr>0 
t t= (4) 


a(e~—1) ifx<0 
T ifx > 0 


ELU (az) = 
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1.67326(e* — 1) ifx<0 
> If m > 
J if 2 n” 0 (6) 


©) l 
GELU (z) = 0.52 ( + tanh v: (a+ nouise’) )) 


SELU (x) = 1.597 l 


(7) 
inata =e cee 
ish(x) ( 7 
Softmax(a;) = _exp(zi) 
Daj exp(w;) 0) 


The difference between filters that use convolutions 
(common in image processing tasks) and the convolutional 
layers of CNNs is that, instead of applying a pre- 
determined kernel to the input, it learns the best parameters 
of the kernel to extract features due to the training process 
[33, 39, 34]. 


Another critical concept in CNN theory is the pooling 
layer, which replaces a small neighborhood of a feature 
map with some statistical information, such as mean or 
max [39]. This process is vital because it sub-samples 
images, reducing the dimensionality of the feature maps by 
introducing a translation invariance to small shifts and 
distortions and decreasing the number of learnable 
parameters [58]. 


The combination of convolutional layers, activation 
functions, and pooling operations is usually called 
Convolutional Backbone, and its role is to extract high- 
level features [1]. 


Usually, a CNN used to classify an image is composed 
of input, the convolutional backbone, and a classifier head. 
This last one is typically composed of fully connected 
artificial neural networks (ANN), which have several 
perceptrons connected among each other. 


The process of finding the best weights of the neural 
network has two steps: a forward stage and a backward 
stage [27]. According to [27], the first step uses the current 
weights and biases of the network to process the input and 
calculate a prediction. Then this prediction is compared to 
the expected output (ground truth) with a function called 
loss. After determining the loss, the gradients of each 
parameter are updated in the backward stage using the 
chain rule, a method called backpropagation [9]. 


The objective of the training process is to minimize the 
loss function, which means that the outputs of the trained 
neural networks are similar to the ground truth. To carry 
out the training, the weights of the neural network need to 
be initialized, and the way they are set can impact the 
training time. 
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According to [65], two popular initialization methods are 
Glorot (a.k.a. Xavier initialization) [66] and He (a.k.a. 
Kaiming initialization) [67]: the first has as its primary 
goal achieve faster convergence and better accuracy by 
scaling the neural network weights so that the variance of 
the input is equal to the conflict of the output [65]; the 
second aims to achieve depth independent performance by 
the 
nonlinearities [65]. The weights of a neural network can 


modifying scaling factor to account rectifier 
also be initialized from a previously trained network, a 
technique that is known as transfer learning. [68] defines 
four types of transfer learning: instance-based, mapping- 


based, network-based, and adversarial-based. 


To achieve convergence faster during the training process, 
some algorithms with adaptative learning rates can be 
used. In neural networks studies, these algorithms are 
usually gradient-based and are called optimizers [69]. 
Some examples of them are Stochastic Gradient Descend 
(SGD) [70], AdaGrad [71], Nesterov Accelerated Gradient 
(NAG) [72], Adaptative Moment Estimation (Adam) [73], 
Rectified Adam (RAdam) [74], Adaptative and Momental 
Bound (AdaMod) [75] and Adaptative Second Order 
(AdaHessian) [76]. 


Regarding loss functions, [77] summarizes some of the 
available ones that are usually chosen for semantic 
segmentation tasks. Among those, it is worth mentioning 
the ones that are commonly used in semantic segmentation 
papers: the Cross-Entropy (CE) [78], the Weighted Cross- 
Entropy (WCE) [79], the Dice [80], the IoU/Jaccard [81], 
the Tversky [82] and the Focal Tversky [83]. The 
mathematical formulation of each cited loss function is 
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described respectively in the equations 10, 11, 12, 13, 14, 
and 15, where N is the number of pixels, g; is the binary 
indicator of whether the class label c is correctly classified 
for pixel i, s“iis the corresponding predicted probability, a 
and p are hyperparameters used to control the balance 
between false positives and false negatives, and y is a 
coefficient in the interval [1,3]. 


Some metrics can be used to evaluate the quality of the 
According to [84], 
accuracy (OA), precision, recall, and the Fı index are 


trained neural networks. overall 


helpful for evaluating the quality of the training, and they 
are defined by the following equations: 


OA= iV n 

FP+ FN (16) 
ee a 3 

prenso a FP (17) 

recall = ss 
TP+FN (18) 

noae O 

precision + recall (19) 


where TP, TN, FP, and FN are, respectively, the true 
positives, the true negatives, the false positives, and the 
false negatives. 


According to [31], the Jaccard Index, also known as 
intersection over union (loU), can be defined by: 
IAN B| 
|AU B| 





IoU = J(A, B) = 
(20) 


where A e B are, respectively, the ground truth and the 
predicted data. 


i NC 
Lor = -H > 2s log s; 
Z7 (10) 
1 N C 
LwceE = -— N ye y Weg: log s! 
i=] c=] aD 
2 A 1 > hee L gi Si 
LDice = 1- — i=1 4+c= E 
ied eg Eaa (12) 
Liou =l1- N peal Resi! i i 
ee Seal 1 (g; s; aar g; s‘ =) R 
N C CoC 
Lr. rsky = N - i= 1 eet 1 g; 8; 
pene oo 1 (g£s$) +a ds ae 1 (1. — gF) ome: o Geil EER OF (1 — 8°) (14) 
d 
LFT = (1 = Lawayeny | 7 (15) 
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Also, according to [31], the mean intersection over 
union index (mIoU) can be defined by: 


7 Aprea N Atrue Atrue 
mIoU = ae 
m Apred U 4 Atrue 


where m is the number of expected classes, Aprea is the 


(21) 


prediction set, and Ajyeis the ground truth set. 


3.2. Convolutional Neural Networks Training 
Improving Techniques 


Convolutional Neural Networks usually take a long time to 
train, even when using a GPU. This occurs due to the fact 
of the large number of weights that have to be adjusted in 
the process of backpropagation: the larger the number of 
parameters of the model, the longer it will take to train. 
This can be overcome using distributed training on several 
GPUs and increasing the batch size. 


In addition, the time spent on the training process also 
depends on the number of samples that the training dataset 
has. On the one hand, if there are not enough images on 
the training dataset, the neural network will not” see” a 
significant number of patterns to learn and perform poorly 
on the training dataset. This below-average learning is 
known as underfitting. On the other hand, if the number of 
images is not high enough, the neural network can 
memorize the data and perform well on the training 
dataset, but poorly on the test dataset, known as overfit 
[64, 85]. 


Moreover, the performance on test datasets can be 
improved by using regularization techniques, which are 
defined by [64] as any modification made to a learning 
algorithm that is intended to reduce its generalization error 
but not its training error. Some examples of regularization 
techniques are weight decay, label smoothing, early 
stopping, dropout, batch normalization, and data 


augmentation. Each of these is described below: 


e Weight decay (a.k.a. L2 Regularization) is a 
method that modifies the weights of a neural 
network in such a way that the loss to be minimized 
is added a penalty of the Lz norm of the weights [64]. 


e Label smoothing [86, 64] is a technique that adds 
noise to the label, mitigating the effect of some 
incorrect label that the dataset may have. It also has 
the advantage of preventing the pursuit of hard 
without 


probabilities discouraging correct 


classification [64]. 


e Early stopping consists of stopping the training 
when the neural network stops learning, in other 
words, when the validation metrics stop improving 


[64]. 
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e Dropout [87] is a technique used to reduce the 
dependency of some neurons on neural networks. At 
each training step, it is calculated a probability of the 
neuron to be shut down, and if it is larger than the 
set threshold, this element is turned off (outputs 
zero). This has a regularizing effect since it forces 
the network to learn patterns with other connected 
neurons. 


e Batch Normalization [88] is a model 
reparameterization technique that introduces both 
additive and multiplicative noise on the hidden units 
at training time by normalizing the inputs to outputs 
with zero mean and unit variance [64]. 


e Data augmentation is a technique that uses image 
manipulation to create new training samples [64, 
89]. Common data augmentation operations are 
random crop, random flip, and random color jitters. 
Furthermore, a novel data augmentation technique 
that has been recently employed in CS papers is 
Mixup [90], which consists of building synthetic 
images composed of a weighted sum of random 
pairs of the training data. According to [64, 89], data 
augmentation also has a regularizing effect, and it 
may contribute to avoid overfitting. One step further 
on data augmentation is using  self-supervised 
techniques to learn from data the augmentation 
procedures that can achieve better metrics. As 
examples of such methods, we can cite 
AutoAugment [91], Faster AutoAugment [92], and 
RandAugment [93]. 


Furthermore, there is another approach to training 
optimization, which is the usage of Learning Rate 
Scheduling [94]. This technique changes the value of the 
learning rate according to some heuristic to try to improve 
the neural network accuracy and reduce training time [95, 
96]. Some examples are Time Based Exponential Decay 
[97], Exponential Decay [98], Linear Warmup, Cosine 
Annealing [96], Cosine Power Annealing [99], and One- 
Cycle Learning Rate Scheduling Policy [100]. 


Finally, the last training improving technique that we 
will cover is Stochastic Weight Averaging (SWA) [101, 
102], which is a procedure used to optimize the neural 
network that averages multiple points along the trajectory 
of Stochastic Gradient Descent (SGD), with specific 
learning rate procedures, that can be either cyclical or 
constant. The usage of this technique can help the 
optimizer to find a better optimization landscape, which 
might lead to better optimization results. 


3.3. Main Convolutional Neural Network Backbones 
used on Semantic Segmentation Tasks 
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In this subsection, we will briefly present the key ideas 
regarding the main convolutional neural networks used to 
perform semantic segmentation tasks in RS. From our 
bibliographic research carried out in 2.1, we analyzed the 
results shown in figures 3 and 5, and then we identified 
key backbones to be explained in this section. The chosen 
backbones were AlexNet [22], ZFNet [23], GoogLeNet 
[24], VGG-19 [24], the ResNet family [103], Inception 
[86, 104], XCeption [105] and MobileNet [106, 107, 108]. 
From the bibliographic research done in Computer 
Sciences, we came across the following worth mentioning 
backbones: ResNeXt, ResNeSt, and EfficientNet. 


According to [1, 109], convolutional neural networks 
(CNNs) were introduced by [9] and in 2012, [110] used 
them in a model called AlexNet to win the ImageNet 
Large Scale Visual Recognition Challenge ILSVRC) [22]. 
According to [8], in 2013 and 2014, ILSVRC were also 
won by CNNs, with models respectively called ZFNet 
[23], GoogLeNet [24]. [1] define the architectures AlexNet 
[110], ZFNet [23] and VGG-19 [24] as Vintage 
Architectures. 


In 2015, the family of architectures called ResNets 
[103] introduced skip connections to address the 
vanishing/exploding gradient [66, 111], which prevented 
deep neural networks from having a large number of 
layers. Due to this idea, deeper models were possible, and 
then the 2015’s ILSVRC was won by a ResNet-152. The 
ResNet family has the ResNet blocks as its basic building 
blocks, a series of convolutions and activations stacked. 
There is a concatenation operation by the end of the block 
(also called skip connections) to preserve some of the 
input information. 


To further push the boundary regarding the 
performance of the ResNet family-based algorithms, [86, 
104] developed a family of architectures called Inception, 
which has as its basic block the inception block. Different 
from ResNet blocks that only concatenate the input of the 
block with the output, the inception block has several 
outputs: each output is the result of a different stacking of 
convolutions and pooling operations. Further advances on 
such idea were also proposed by the XCeption family 
[105] and the MobileNet family [106]. 


Thus, [112] evolved the idea of the Inception Block by 
proposing a backbone called ResNeXt: in this method, a 
cardinality value to the blocks is proposed, which widens 
the block with more branches of stacked convolutions, 
enabling further representation learning. Other backbone 
architectures that are worth mentioning are the SE-ResNet 
[50] and the ResNeSt [49]. The first method proposes the 
usage of an attention mechanism at the beginning and the 
end of the ResNet block, composing the Squeeze and 
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Excite block, which performs dynamic channel-wise 
feature recalibration, to improve the representational 
power of the network. The latter method proposes the 
usage of Split-Attention Block, which adds the same idea 
of cardinality to the SE-Net-Block proposed by [50]. 


Recently there have been some breakthrough 
architectures using Neural Architecture Search (NAS) 
[113, 114, 115], which is a reinforcement learning 
technique to find out the best architecture to perform tasks 
on object detection and semantic segmentation [1]. Using 
NAS techniques, in late 2019, researchers at Google have 
created a series of backbones called EfficientNet [116]. In 
2020, another group from Google had published a paper 
called EfficientDet: Scalable and Efficient Object 
Detection [117], in which they improved EfficientNets and 
proposed a weighted bi-directional feature pyramid 
network (BiFPN). According to [117], with these 
improvements, the research team achieved 4x smaller 
networks that used 13x fewer FLOPs, with a gain of 0.2% 
of mean average precision (mAP) of state-of-the-art mAP 
on the COCO dataset. 


3.4. Main Convolutional Neural Network Architectures 
Used on Semantic Segmentation Tasks 


In neural network applications, the convolutional 
backbone is often combined with other structures 
depending on the task that we want to perform. It can be 
used with a design such as fully convolutional layers to 
perform classification. In the case of semantic 
Segmentation, there are some approaches, as using naive 
encoders and encoder-decoder structures [1]. There are 
also Generative Adversarial Networks (GAN) [39, 118, 
119] and Recurrent Neural Networks (RNNs) with Long 
Short-Term Memory (LSTM) [30] approaches to perform 
semantic segmentation tasks, but we will not cover those 
techniques in this paper. More information on those 
techniques can be found on [1, 30, 42]. 


Naive decoders normally use a convolutional backbone 
and trained deconvolutional layers to perform the 
upsampling task to generate the segmentation mask, 
combined with some interpolation method such as bilinear. 
Some examples of this type of architecture are Fully 
Convolutional Networks (FCN) [120], DeepLabV1 [121], 
DeepLabV2 [122], ParseNet [123], PSPNet [124] and 
DeepLabV3 [125]. 


Encoder-decoder models, in contrast to naive decoder, 
instead of using an interpolation method to upsample the 
feature maps, use a more complex decoder, with shortcuts 
or skip connections to maintain information from the 
encoder to the decoder and gradually perform the 
upsampling [1]. Some examples of this type of model are 
the DeconvNet [126], the SegNet [127], the U-Net [79], 
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the U-Net++ [128], the DoubleU-Net [129], the 
MultiResUNet [130], the RefineNet [131] and the 
DeepLabV3+ [132]. The architecture of an encoder- 
decoder architecture called U-Net is shown in figure 6. 


A novel type of encoder-decoder architecture is the 
HRNet (or High-Resolution Net) [133] and the HR-Net 
OCR[53], both of which are featured on top positions of 
the Cityscapes benchmark, as shown in table 1. This 
method aims to maintain high-resolution images at every 
stage of the process by combining different parallel chains 
of convolutions and strided convolutions. Object- 
Contextual Representations (OCR) is an attention 
mechanism [134] that considers the context of the 
considered pixel instead of it alone. OCR can be combined 
with different backbones such as ResNet-101 and 
Xception and different architectures such as DeepLabV3+ 
to improve segmentation results, as shown by [135]. When 
OCR is combined with HR-Net, we have the HR-Net OCR 
architecture. 


Another type of attention mechanism that can be 
combined with HR-Net is the Polarized Self-Attention 
(PSA) [56], which has two main operations in its design: 
the polarized filtering and enhancement component. This 
type of attention mechanism not only looks at spatial 
features but also channel representations. 


Finally, another worth mentioning set of techniques is 
the usage of EfficientNet backbones with Feature Pyramid 
Networks (FPN), combined with self-training techniques 
such as noisy student, which is a semi-supervised learning 
technique that improves the training results [57]. Table 1 
shows that the best method on PASCAL VOC 2012 test 
dataset is the usage of EfficientNet trained with noisy 
student technique (a.k.a. EfficientNet-L2) with FPN 
architecture and Neural Architecture Search (NAS) [54]. 
On the other hand, the best model on PASCAL Context is 
the combination of a plain EfficientNet-B7 with an 
attention mechanism called Channelized Axial Attention 
(CAA) [55]. 


3.5. Applications on Remote Sensing and Examples of 
Available Datasets 


Deep Learning (DL) plays an important role in nowadays 
science is particularly geosciences. There are several RS 
research papers such as [136], [137], and [138] that 
compare classical computer vision techniques to DL 
techniques, and they show that DL can achieve better 
accuracies. 
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DL-based techniques can solve several problems in 
Geosciences. Among those problems we can cite object 
detection [139, 140], hyperspectral image classification 
[10, 141], super-resolution [142, 143, 144], change 
detection [145, 146] and semantic segmentation. 


Regarding Semantic Segmentation [84, 147, 148, 149], 
there are some use cases, such as building footprint 
extraction [11, 12, 150, 13, 14, 15, 16, 17, 18], road 
extraction [151, 152, 153] and land use and land cover 
(LULC) analysis [154, 155]. 


To train neural networks that can solve LULC 
problems, data from the ISPRS Potsdam and Vaihingen 
[156, 157] can be used. This is a dataset with airborne 
photogrammetric imagery of Potsdam, covering six classes 
(impervious surfaces, building, low vegetation, tree, car, 
and clutter/background). 


Moreover, to perform training of deep convolutional 
neural networks that can extract building footprints, some 
of the open datasets available online are listed below, and 
the details are shown in table 2: 


e SpaceNet [158, 159]: dataset with satellite 
imagery of the following cities: Rio de Janeiro, Las 
Vegas, Paris, Khartoum, and Shanghai. 


e Massachusetts [160]: dataset with satellite 
imagery of the city of Boston. 
e WHU building [161]: dataset with airborne 


photogrammetric imagery of New Zealand. 


e INRIA aerial [162]: dataset with satellite imagery 
from the following cities: Austin, Chicago, Kitsap 
County, Western Tyrol, and Vienna. 


e LandCover.ai [163]: dataset with satellite imagery 
of Poland. 


e AIRS [164]: dataset with satellite imagery of 
Christchurch City in New Zealand. 


e CrowdAI [165]: a simplified version of the 
SpaceNet Dataset, with only RGB images. 
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Fig. 6: Basic structure of a U-Net. Figure built using https://github.com/HarisIgbalS8/PlotNeuralNet. 


Table 2: Comparison between building footprint datasets 









































Dataset # of | #oftiles | Tile Size Spatial 
buildings Resolution 
LandCover.ai 12,788 41 33 tiles | 25cm and 
with the 50cm 
size 9000 x 
9500 px 
and eight 
tiles with 
size 
4200 x 
4700 
px 
INRIA 216,418 360 5000 x | 30cm 
5000 
px 
Massachusetts 310,425 151 1500 x | lm 
Buildings ae 
px 
Spacenet 462,091 17,533 512 x 512 | 35cm 
px 
WHU build- | 220,000 25,577 512 x 512 | 7.5 cm 
ing dataset px ma a 
cm 
AIRS 220,000 1,047 10,000 7.5 cm 
xX 
10,000 px 
CrowdAI Unknown 280,741 300 x 300 | Unknown 
training px 
images, 
60,317 
validation 
images 
and 
60,697 
test 
images 
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3.6. Available Frameworks and Tools 


The two most famous deep learning frameworks are 
Tensorflow [166] and PyTorch [167]. Both are open 
have communities, are very well 


source, large 


documented, and have outstanding performance. 
Tensorflow has an underlying library called Keras [168], 
enabling a higher level and more readable code. On the 
PyTorch side, PyTorch Lightning [169], FastAI [170], and 
Catalyst [171], among others, are frameworks that provide 


similar improvements given by Keras. 


Considering segmentation models tools openly 
available, there are two frameworks developed in Python 
and PyTorch, 
segmentation models [172] and segmentation models 
PyTorch [173]. To train segmentation models without 
coding skills, users can build a JSON file with the 


parameters of the training and use a Python package called 


that use Tensorflow respectively 


segmentation models trainer [174], which was built using 
Tensorflow, Keras, and segmentation models. [175] has 
also created a training framework using PyTorch and 
PyTorch Lightning called PyTorch segmentation models 
trainer, which instead of using a JSON to fill the 
hyperparameters, uses a YAML file using configuration 
composition, which enables users to reuse settings. To 
build training masks from vector data, a QGIS [176] 
plugin called DeepLearningTools [177] can be used. 


There are also tools to help to build and to inspect 
datasets, such as FiftyOne [178]. With this tool, data 
scientists can visualize the labels overlapped to the images 
and calculate image similarity indexes to assess the quality 
of the dataset and identify missing labels. 


Concerning data augmentation, each library has built-in 


operations. As external options, we can cite 


Albumentations [179], a Python package that is framework 
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agnostic and works only on CPU. Another option on the 
PyTorch ecosystem is Kornia [180], a package that works 
on either CPU or GPU. 


IV. CONCLUSION 


In this paper, we presented the SOTA of Semantic 
Segmentation in Remote Sensing, an ever-growing field of 
research, with an almost exponential increase in the 
number of publications, as shown in section 2.1. We 
identified that the most used backbones on RS tasks are the 
ResNet family, VGG-16, Inception-V3, and AlexNet. 
Furthermore, we identified that the most famous 
architectures used in RS are the U-Net, DeepLabV3+, 
FCN, and SegNet. We also briefly showed the main 
theories, algorithms, and neural networks architectures and 
backbones. 


This paper has also briefly presented how 
convolutional neural networks work and the techniques 
used for training such structures, like weight initialization, 
popular optimizers, some of the loss functions available, 
and the often-used metrics in RS papers. We also showed 
some of the existing regularizing techniques such as 
weight decay, label smoothing, early stopping, dropout, 
batch normalization, and data augmentation. 


Then, we also presented some learning rate scheduling 
methods and stochastic weight averaging. We also listed 
the most famous backbones and architectures found on the 
RS papers surveyed and presented some applications of 
such techniques on RS. We also showed some available 
datasets and popular frameworks and packages to train 
deep learning convolutional neural networks. 


There are many research papers in CS that propose 
several neural architectures, and some have been used in 
RS applications. Deep Learning is an ever-growing field, 
and in 2020 there have been many promising and exciting 
new backbones, such as the EfficientNet family, the 
ResNeSt-269 [49], and the SE-ResNet family [50]. 


Moreover, we have identified a research opportunity in 
RS to combine the mentioned backbones with popular 
architectures such as U-Net, FPNs, and PSPNet. Another 
research opportunity is the usage of HRNet-OCR [53], 
HRNetV2-OCR+PSA [56], EfficientNet-B7+CAA [55], 
and EfficientNet-L2+NAS-FPN [54], which are in the 
leader board of Papers With Code [21], but was not 
observed in the surveyed papers regarding remote sensing 
applications. 


In addition, another research opportunity that we 
identified is to perform an extensive comparison of the 
accuracy of trained models with several combinations of 
neural networks architectures and backbones to define the 
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best method to extract information from very-high remote 
sensing images. We can also highlight other research 
opportunities, such as determining the best loss function to 
be used in training and the best inference method to 
improve validation data accuracy. The suggested loss 
function for such a study is the Focal Tversky [83] since it 
handles class imbalance problems, a common problem in 
remote sensing datasets, especially building footprint 
extraction datasets. 


Additionally, even though new optimizers such as 
RAdam, AdaMod, and AdaHessian have been proposed, 
few papers in remote sensing have tested them. The same 
principle can be applied to activation functions such as 
Leaky-ReLU, ELU, SELU, GELU, and Mish. So, we also 
identify research opportunities of the influence of 
optimizers and activation functions in the training time and 
the test metric scores. 


Finally, other aspects that we did not find in the 
surveyed papers and that can be researched is the usage of 
stochastic weight averaging [101, 102], novel 
augmentation techniques such as Mixup [90], 
AutoAugment [91], Faster AutoAugment [92] and 
RandAugment [93]. 
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